On Building a Full-Text Digital Library of Historical Documents
نویسندگان
چکیده
The National Taiwan University Library has built a digital library of historical documents about Taiwan. The content is unique in that it covers about 80% of all primary Chinese historical materials about Taiwan before 1895, and that they are all available in searchable full text, in addition to metadata. To make these materials more accessible to the research community, we have developed, in addition to full-text search and retrieval, a concept of regarding the set of documents retrieved by a query as a sub-collection, and have designed post-query classification methods to help users find the inter-relationships among documents and the collective meaning of a sub-collection. We have also developed techniques for term extraction for old Chinese and a data format for representing governmental structures. We hope that our system will help advance research in Taiwanese history, and will set a model for other similar endeavor.
منابع مشابه
Mining dates from historical documents
The essential quality of information in a digital library is accessibility. Full text search is not enough for some collections, more can be done. Historical collections, for example, contain dates, and it would be useful to historians to be able to search by them. However, these dates occur anywhere within the text of historical documents, and to be searched they must be extracted from the doc...
متن کاملHandwritten Text Recognition for Historical Documents
The amount of digitized legacy documents has been rising dramatically over the last years due mainly to the increasing number of on-line digital libraries publishing this kind of documents. The vast majority of them remain waiting to be transcribed into a textual electronic format (such as ASCII or PDF) that would provide historians and other researchers new ways of indexing, consulting and que...
متن کاملCustomizing Digital Library Interfaces with Greenstone
The Greenstone digital library software is intended to help users construct simple collections of information very quickly. Indeed, only a few minutes of the user’s time are needed to set up a collection based on a standard design and initiate the building process. Collections may be large—some comprise Gbytes of text; millions of documents. Furthermore, even larger volumes of information may b...
متن کاملA Writer Identification System of Greek Historical Documents using MATLAB
-In this paper we present a system for writer identification from historical lines of text, where features are extracted and used to recognize individuals. The main goal is to analyze documents of different writing styles in order to identify the writers. We consider a complete 2D probability distribution that takes into account all possible combinations of angle pairs, outperforming original c...
متن کاملIn Codice Ratio: Scalable Transcription of Historical Handwritten Documents
Huge amounts of handwritten historical documents are being published by digital libraries world wide. However, for these raw digital images to be really useful, they need to be annotated with informative content. State-of-the-art Handwritten Text Recognition (HTR) approaches require an impressive training effort by expert paleographers. Our contribution is a scalable, end-to-end transcription w...
متن کامل